visual reasoning AI News List

Time	Details
2026-04-30 11:53	DeepSeek Visual Primitives Beat Giants According to KyeGomezB, DeepSeek’s visual primitives let models point while reasoning, matching or beating GPT5.4 and Claude Sonnet on visual QA. Source
2026-03-30 19:03	GPT-5.4 Pro Analysis: How ChatGPT Visually Interprets Scientific Figures for Faster Research Workflows According to @emollick, ChatGPT GPT-5.4 Pro and the Thinking harness excel at reading scientific papers by identifying key figures and inspecting them visually, rather than relying only on text. As reported by Ethan Mollick on X, this visual reasoning enables the model to prioritize salient charts and diagrams, improving literature review speed and accuracy for R&D and competitive analysis. According to Mollick, these capabilities suggest practical applications in automated paper triage, figure-centric summarization, and hypothesis generation workflows for research teams and knowledge workers. Source
2026-03-13 15:00	Claude Visual Thinking Breakthrough: 5 Starter Prompts and Mastery Guide for 2026 Prompt Engineering According to God of Prompt on X, Claude has added visual thinking capabilities and the team released a Claude Mastery Guide featuring prompt engineering principles tailored to Claude, 10+ tested mega-prompts, and advanced techniques most users miss, with details available at godofprompt.ai (source: God of Prompt tweet on Mar 13, 2026). As reported by the same source, the guide positions practitioners to leverage Claude’s multimodal reasoning through structured visual decomposition prompts, diagram-first instructions, and stepwise spatial reasoning, enabling faster UI wireframing, data chart interpretation, and workflow mapping for product and ops teams. According to God of Prompt, businesses can operationalize these prompts to accelerate requirements gathering, convert sketches to structured outputs, and standardize prompt libraries for customer support, design sprints, and analytics documentation, improving time-to-value and prompt reproducibility. Source
2026-01-29 16:41	Latest Agentic Vision Rollout in Gemini App: Enhanced Thinking Mode with Gemini 3 Flash According to Google Gemini (@GeminiApp), Agentic Vision is now being integrated into the Gemini app, accessible when users select the 'Thinking' model option. This update, highlighted in Gemini 3 Flash, aims to deliver advanced reasoning and perception capabilities within the app. As reported by Google Gemini, this rollout is expected to enhance user experience for tasks requiring sophisticated visual and cognitive processing, opening new business opportunities for developers and enterprises leveraging the Gemini platform. Source
2025-11-26 11:09	Chain-of-Visual-Thought (COVT): Revolutionizing Visual Language Models with Continuous Visual Tokens for Enhanced Perception According to @godofprompt, the new research paper 'Chain-of-Visual-Thought (COVT)' introduces a breakthrough method for Visual Language Models (VLMs) by enabling them to reason using continuous visual tokens instead of traditional text-based chains of thought. This approach allows models to generate mid-thought visual latents such as segmentation cues, depth maps, edges, and DINO features, effectively giving the model a 'visual scratchpad' for spatial and geometric reasoning. The results are significant: COVT models achieved a 14% improvement in depth reasoning, a 5.5% boost on CV-Bench, and major gains on HRBench and MMVP benchmarks. The technique is compatible with leading VLMs like Qwen2.5-VL and LLaVA, with interpretable visual tokens that can be decoded for transparency. Notably, the research finds that traditional text-only reasoning chains actually degrade visual reasoning performance, whereas COVT’s visual grounding enhances accuracy in counting, spatial understanding, 3D awareness, and reduces hallucinated outputs. These findings point to transformative business opportunities for AI solutions requiring fine-grained visual analysis, accurate object recognition, and reliable spatial intelligence, especially in fields like robotics, autonomous vehicles, and advanced multimodal search. (Source: @godofprompt, Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens, 2025) Source
2025-10-07 19:45	Google DeepMind Launches Gemini 2.5: Advanced AI Model Sets New Benchmark for Automated Web Browsing According to Google DeepMind, the new Gemini 2.5 Computer Use model leverages advanced visual understanding and reasoning to enable AI agents to navigate browsers by clicking, scrolling, and typing as a human user would. This upgrade significantly enhances practical AI applications for automated online tasks, streamlining workflows in industries such as customer support, e-commerce, and data entry. The model outperforms previous versions on multiple industry benchmarks, offering improved speed and reliability, which positions it as a game-changer for businesses seeking to automate complex web-based operations (source: Google DeepMind, Twitter, Oct 7, 2025). Source
2025-06-11 17:00	Meta Unveils V-JEPA-v2: Advanced Self-Supervised Vision AI Model for Business Applications According to Yann LeCun (@ylecun), Meta has released V-JEPA-v2, a new version of its self-supervised vision model designed to significantly improve visual reasoning and understanding without reliance on labeled data (source: @ylecun, June 11, 2025). V-JEPA-v2 leverages joint embedding predictive architecture, enabling more efficient training and better generalization across varied visual tasks. This breakthrough is expected to drive business opportunities in industries such as autonomous vehicles, retail analytics, and healthcare imaging by lowering data annotation costs and accelerating deployment of AI-powered vision systems. Source

2026-04-30
11:53

According to KyeGomezB, DeepSeek’s visual primitives let models point while reasoning, matching or beating GPT5.4 and Claude Sonnet on visual QA.

Source

2026-03-30
19:03

GPT-5.4 Pro Analysis: How ChatGPT Visually Interprets Scientific Figures for Faster Research Workflows

According to @emollick, ChatGPT GPT-5.4 Pro and the Thinking harness excel at reading scientific papers by identifying key figures and inspecting them visually, rather than relying only on text. As reported by Ethan Mollick on X, this visual reasoning enables the model to prioritize salient charts and diagrams, improving literature review speed and accuracy for R&D and competitive analysis. According to Mollick, these capabilities suggest practical applications in automated paper triage, figure-centric summarization, and hypothesis generation workflows for research teams and knowledge workers.

Source

2026-03-13
15:00

Claude Visual Thinking Breakthrough: 5 Starter Prompts and Mastery Guide for 2026 Prompt Engineering

According to God of Prompt on X, Claude has added visual thinking capabilities and the team released a Claude Mastery Guide featuring prompt engineering principles tailored to Claude, 10+ tested mega-prompts, and advanced techniques most users miss, with details available at godofprompt.ai (source: God of Prompt tweet on Mar 13, 2026). As reported by the same source, the guide positions practitioners to leverage Claude’s multimodal reasoning through structured visual decomposition prompts, diagram-first instructions, and stepwise spatial reasoning, enabling faster UI wireframing, data chart interpretation, and workflow mapping for product and ops teams. According to God of Prompt, businesses can operationalize these prompts to accelerate requirements gathering, convert sketches to structured outputs, and standardize prompt libraries for customer support, design sprints, and analytics documentation, improving time-to-value and prompt reproducibility.

Source

2026-01-29
16:41

Latest Agentic Vision Rollout in Gemini App: Enhanced Thinking Mode with Gemini 3 Flash

According to Google Gemini (@GeminiApp), Agentic Vision is now being integrated into the Gemini app, accessible when users select the 'Thinking' model option. This update, highlighted in Gemini 3 Flash, aims to deliver advanced reasoning and perception capabilities within the app. As reported by Google Gemini, this rollout is expected to enhance user experience for tasks requiring sophisticated visual and cognitive processing, opening new business opportunities for developers and enterprises leveraging the Gemini platform.

Source

2025-11-26
11:09

Chain-of-Visual-Thought (COVT): Revolutionizing Visual Language Models with Continuous Visual Tokens for Enhanced Perception

According to @godofprompt, the new research paper 'Chain-of-Visual-Thought (COVT)' introduces a breakthrough method for Visual Language Models (VLMs) by enabling them to reason using continuous visual tokens instead of traditional text-based chains of thought. This approach allows models to generate mid-thought visual latents such as segmentation cues, depth maps, edges, and DINO features, effectively giving the model a 'visual scratchpad' for spatial and geometric reasoning. The results are significant: COVT models achieved a 14% improvement in depth reasoning, a 5.5% boost on CV-Bench, and major gains on HRBench and MMVP benchmarks. The technique is compatible with leading VLMs like Qwen2.5-VL and LLaVA, with interpretable visual tokens that can be decoded for transparency. Notably, the research finds that traditional text-only reasoning chains actually degrade visual reasoning performance, whereas COVT’s visual grounding enhances accuracy in counting, spatial understanding, 3D awareness, and reduces hallucinated outputs. These findings point to transformative business opportunities for AI solutions requiring fine-grained visual analysis, accurate object recognition, and reliable spatial intelligence, especially in fields like robotics, autonomous vehicles, and advanced multimodal search. (Source: @godofprompt, Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens, 2025)

Source

2025-10-07
19:45

Google DeepMind Launches Gemini 2.5: Advanced AI Model Sets New Benchmark for Automated Web Browsing

According to Google DeepMind, the new Gemini 2.5 Computer Use model leverages advanced visual understanding and reasoning to enable AI agents to navigate browsers by clicking, scrolling, and typing as a human user would. This upgrade significantly enhances practical AI applications for automated online tasks, streamlining workflows in industries such as customer support, e-commerce, and data entry. The model outperforms previous versions on multiple industry benchmarks, offering improved speed and reliability, which positions it as a game-changer for businesses seeking to automate complex web-based operations (source: Google DeepMind, Twitter, Oct 7, 2025).

Source

2025-06-11
17:00

Meta Unveils V-JEPA-v2: Advanced Self-Supervised Vision AI Model for Business Applications

According to Yann LeCun (@ylecun), Meta has released V-JEPA-v2, a new version of its self-supervised vision model designed to significantly improve visual reasoning and understanding without reliance on labeled data (source: @ylecun, June 11, 2025). V-JEPA-v2 leverages joint embedding predictive architecture, enabling more efficient training and better generalization across varied visual tasks. This breakthrough is expected to drive business opportunities in industries such as autonomous vehicles, retail analytics, and healthcare imaging by lowering data annotation costs and accelerating deployment of AI-powered vision systems.

Source

List of AI News about visual reasoning